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Abstract 

Background: Protein sequence profile-profile alignment is an important approach to recognizing remote homologs 
and generating accurate pairwise alignments. It plays an important role in protein sequence database search, 
protein structure prediction, protein function prediction, and phylogenetic analysis. 

Results: In this work, we integrate predicted solvent accessibility, torsion angles and evolutionary residue coupling 
information with the pairwise Hidden Markov Model (HMM) based profile alignment method to improve 
profile-profile alignments. The evaluation results demonstrate that adding predicted relative solvent accessibility 
and torsion angle information improves the accuracy of profile-profile alignments. The evolutionary residue 
coupling information is helpful in some cases, but its contribution to the improvement is not consistent. 

Conclusion: Incorporating the new structural information such as predicted solvent accessibility and torsion angles 
into the profile-profile alignment is a useful way to improve pairwise profile-profile alignment methods. 



Background 

Pairwise protein sequence alignment methods have been 
essential tools for many important bioinformatics tasks, 
such as sequence database search, homology recognition, 
protein structure prediction and protein function predic- 
tion [1-5]. Following the development of global and local 
alignment methods of aligning two single sequences [6-8], 
profile-sequence alignment or profile-profile alignment 
methods such as PSI-BLAST, SAM [9], HMMer [10], 
HHsearch, HHsuite [4-6], which enrich two single se- 
quences with their homologous sequences, has substan- 
tially improved both the sensitivity of recognizing 
remote homologs and the accuracy of aligning two pro- 
tein sequences. 

Due to their relatively high sensitivity in recognizing re- 
mote protein homologs, profile -profile alignment methods 
have become the default structural template identification 
method for many template-based protein structure 
modeling methods and servers [11-14]. For instance, 
HHsearch, one of top profile-profile alignment tools 
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based on comparing the profile hidden Markov models 
(HMM) of two proteins, was used by almost all the 
template-based protein structure prediction methods 
tested during the last two Critical Assessment of Tech- 
niques for Protein Structure Prediction (CASP) [15,16]. 
The open source package HHsuite contains both the lat- 
est implementation of HHSearch that supports a full 
HMM-HMM alignment-based search on a HMM pro- 
file database and a very fast search tool HHblits [5] that 
reduces the number of unnecessary full HMM pairwise 
alignment in order to drastically improve its search speed. 
Moreover, the maximum accuracy (MAC) alignment 
algorithm is applied in HHsuite, but not in HHsearch. In 
this work, we aim to introduce new sources of informa- 
tion to improve profile-profile alignments with respect to 
both the original HHsearch package and the open source 
HHsuite package, 

In order to more accurately align the structurally equiva- 
lent residues in a target protein and a template protein to- 
gether, secondary structure information was incorporated 
into profile-profile sequence alignment methods, yielding 
the better sensitivity and accuracy [4,17]. Aiming to find 
the new source of information to further improve the sen- 
sitivity and accuracy of pairwise profile-profile alignment, 
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we examine the effectiveness of incorporating into profile- 
profile alignment methods some new features that have 
not been used in profile-profile alignments before, includ- 
ing protein solvent accessibility, torsion angles, and the 
evolutionary residue coupling information [18,19]. 

Specifically, we add the additional scoring terms for 
solvent accessibility, torsion angles, and evolutionary 
residue coupling information into the scoring function 
of HHsuite [5] in order to enhance the alignment process. 
According to our evaluation, adding solvent accessibility 
and torsion angles can improve the alignment accuracy, 
but incorporating the evolutionary residue coupling infor- 
mation is only useful in some cases. 

Methods 

We extended an existing profile-profile alignment method 
within the standard five-step alignment framework of 
HHsuite [5] shown in Figure 1, including discretization of 
profile columns, removal of very short or very dissimilar 
sequences, execution of Viterbi alignment and calculation 
of E-value and probability, realignment based on the 
maximum accuracy (MAC) algorithm, and retrieval of 
alignments by tracing-back. Different from HHsuite, our 
method applies solvent accessibility and torsion angle 
information to both the Viterbi alignment and the max- 
imum accuracy alignment, and traces back with the aid 
of the evolutionary residue coupling information. In the 
following sections, we focus on describing how to incorp- 
orate the new features into the profile-profile method (i.e., 
HHsuite), while briefly introducing the necessary technical 
background. 



Adding solvent accessibilities and torsion angles into the 
viterbi alignment 

The score of aligning two columns in two protein pro- 
files (namely a query profile q and a template profile t) 
in HHsuite was calculated according to Equation (1). 

in which q^a) and tj(a) denote the probability of 
amino acid at position i in the query profile and at 
position / the template profile, respectively, and f{a) is 
the background frequency of residue a (ae {1, 2,...., 20}, 
representing 20 types of amino acids). The best align- 
ment between two profile HMMs was obtained by 
maximizing the log-sum-odds score S LS o according to 
Equation (2). 

SlSO = ^ SaaUi^tm) + ^°% P tr ( 2 ) 

k:X k Y k =MM 

where k denotes the index of columns that query 
HMM q aligned to template HMM t, i(k) and j(k) are 
the respective columns in q and t, P tr is the product of 
all transition probabilities for the path through q and t. 
The latest version of HHsuite has included the second- 
ary structure information into the calculation of the 
score. In this work, we further augment the calculation 
of the score by adding the terms to account for the solv- 
ent accessibility, and torsion angles. 

The Viterbi dynamic program algorithm used five 
matrices S AB (i.e., AB e {MM, MI, IM, DG, GD}) repre- 
senting matching different states (M: match, I: insertion, 



Profile 1: target 
and homologs 
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"sequences oT 
target and its 
^template 



"Profile 27 




Discretize profile columns into an 
alphabet of 219 states 



/"Construct a Direct Information Matrix and retrieve 

inferred residue coupling information for both query 
\and template profiles 



Prefilterthe sequence profiles 



Calculate S-score and E-value based on Viterbi alignment based on 
secondary structure, solvent accessibility, torsion angle information 



Maximum accuracy alignment based on structural information 
and trace-back integrating inferred residue coupling information 



| Final pairwise sequence alignment between target and template j 
Figure 1 The workflow of the HMM-based profile-profile pairwise alignment. 
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D: deletion; G: Gap [4]) in two HMMs to maximize the 
augmented log-sum-of-odds score S LS o- They are recur- 
sively calculated as: 

Smm(iJ) = Saa(qi, l tj) + w ss S ss (q t , tj) + w sa S sa (q u tj) 

+ W 'tors Stors {tfii tj) 



SmmV-IJ-I) + \og[q i _ 1 (M 1 M)t hl (M 1 M)] 



+ max * 



S M/ (/-l,/-l) + log 
S IM (i-lJ-l) + log 
S DG (i-l,j-l) + log 
S GD (i-lJ-l) + log 



q^M.M^fM) 

Vi(A^)t;-i(M,Af) 
(M , Af ) £ ; _i (D, Af ) 



+ S shift 



(3) 



Smj(iJ) = max- 



W'-l>/) + logfe_i(M,M)^(M,/)] 
S M i{i-l,j) + ]og[q^(M,M)tj(I,I)] 

(4) 



c a j\ _ ma J S MM(i-l,j) + logfe_i(M,£»)] , . 
Sw(w) " " I «*-!,/) + logfe-^A^)] (5) 

Sim(Uj) and S GD (i,j) are calculated similarly as S M i(hj) 
and S DG (i,j). 

The difference between Equation (3) above and the de- 
fault one in HHsuite is that two new terms (5 sa , «S tors ) 
were added to utilize the solvent accessibility and torsion 
angle information. In Equation (3), S ss (q b tj) is the sec- 
ondary structure score between column i in query 
HMM (qj) and column ; in template HMM (tj), which 
was the same as the one originally used in HHsuite. S sa 
(q b tj) is the solvent accessibility score between q t and tj, 
and S tors (q b tj) is the torsion angle score between q t and 
tj, which are the new terms introduced in this work. w ss , 
w sa , and w tors are weights for the secondary structure 
score, solvent accessibility score and torsion angle score 
respectively. S shi ft is the score offset for match-match 
states. Three weights w ss , w sa , w tors and shift score S shift 
are set to 0.11, 0.72, 0.4 and -0.03 by default, and can be 
adjusted by users as well. q i _ 1 (M i M) is the transition 
probability from state M at column i-1 to next state M 
of in the query HMM, and tj_ x (M,M) is the transition 
probability from state M at column j-1 to next state M 
in the template HMM. 

Here we denote this extension of the HHsuite method 
as HMMsato. HMMsato allows for scoring predicted (or 
known) solvent accessibilities of one protein against pre- 
dicted (or known) ones of another protein. DSSP [20] is 
used to parse the true solvent accessibility of a protein if 
its tertiary structure is known. PSpro 2.0 [21] is used to 
predict the solvent accessibility of a protein. The solvent 
accessibility information can be automatically parsed or 
predicted in HMMsato, or alternatively provided by a user. 
The two types of solvent accessibilities (e: exposed, > = 25% 



of the maximum area of a residue is exposed; b: buried, < 
25% of the maximum area of a residue is exposed) are 
employed. Assuming the predicted or true solvent accessi- 
bility states of the i th residue (xj) of the query protein and 
the j th residue (yj) of the template protein are sa(xi) and sa 
(yj), the solvent accessibility score between the two residues 
S S a(qi> tj) is defined as: 



Ssa (q t ,tj)=8 (sa (x t ) , sa (y^j ) 



(6) 



The score is calculated by the kronecker-delta function 
8(a, b), which equals to 1 if a = b, 0 otherwise. 

Similarly as the solvent accessibility, the torsion angles 
including both phi angle (cp) and psi angle (xj/) can be 
automatically predicted by SPINE -X [22,23] or provided 
by a user. The range of both cp and i// is (-180,180). 
Given the query sequence X and template sequence Y, 
the predicted phi angle and psi angle of the i-th residue 
%i in the query are denoted as cp(xj) and if/fa), and those 
of the ;-th residue yj in the template as cp(yj) and i//(yj). 
The torsion angle score S tors (q b tj) between the two resi- 
dues is calculated as: 



Stors [qi, tj) — 1- 



0.5* 



>(**)-<p(ty)) + (y(xi)-v(yj)) 



180 



(7) 



Realign the profiles by maximum accuracy alignment 
combining solvent accessibility and torsion angles 

It has been shown that maximum accuracy (MAC) algo- 
rithm can generally create a more accurate alignment 
than the Viterbi algorithm, while the latter can generate 
better alignment scores, e-values and probabilities [5,24]. 
Consequently, the Viterbi algorithm is applied to com- 
pute e-values and scores, and the MAC algorithm is 
chosen to generate the final HMM-HMM pairwise align- 
ment in HHsato by default. 

The maximum accuracy algorithm [5,24] creates the 
local alignment that maximizes the sum of probabilities 
for each residue pair to be aligned minus a penalty 

(mact) (i.e., argmax( ^p(qf I -t^ I )-mact}) ), where 

ijealignment 

p(qf 1 -tj d ) represents the posterior probability of the 

match state i in HMM q aligned to the match state / in 
HMM t. With the parameter mact, users can control 
the alignment greediness, from nearly global, long 
alignment (mact = 0) to very precise, short local align- 
ments (mact~l). The default value of mact is set to 
0.3501 in HMMsato as in HHsuite. To find the best 
MAC alignment path, an optimal sub-alignment score 
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matrix AS is calculated recursively using the posterior 
probability P^qf-tj 4 ) as substitution scores: 



J AS(i-lJ-l) 



-mact 

AS(iJ) = max<| AS(i-l,j-l) + p(qf ~tf ) -mact 
AS(i-l,j)-0.5* mact 
K AS(i,j-l)-0.5* mact 



(8) 



Here, the Forward-Backward algorithm in local or glo- 
bal mode is applied to calculate the posterior probabilities 

P^qf-t^ . The Forward partition function F MM (i,j) and 

Backward partition function B MM (i,j) are introduced to 
calculate the posterior probability for pair state (qf 1 ,^) 
according to Equation (9): 



FMM(iJ)BMM(iJ) 
1 + ^2f M m(iJ) 



(9) 



Five dynamic programming matrices F AB are used to 
compute the Forward partition function F MM , and AB e 
{MM,MI,IM,DG, GD}. The top row and left column of 
the F MM matrix were initialized to 0, and all the matrices 
were filled recursively: 



F MM (iJ) = S aa ( qi , %) * 2 w M^j) * 2 W ~ S ~M 

0 min + F M m (i- 1 , /'- 1 )q t -i 

(M,M)t hl (M,M)+F MI (i-lJ-l) qi _ x 

(M 1 M)t hl (I,M)+F IM (i-lJ-l)q i _ l 

(I,M)t hl (M,M)+F DG (i-lJ-l)q i _ 1 

(D 1 M)t hl (M 1 M)+F GD (i-lJ-l)q i _ l 

(M,M)t hl (D,M)) 



(10) 



FMiii-l^q^M.M^ilJ) 

F DG (iJ)=F Am (i-lJ)q^ l (M,D)+ 
Fncii-lJ^iD.D) 



where pmin controls the alignment model (0: global 
alignment mode, 1: local alignment mode). F IM (i,j) 
and F GD (i,j) are calculated similarly as F MI (i,j) and 
FDGihj)< Solvent accessibility score S sa (q if tj) and torsion 
angle score S tors (q i} tj) are calculated as in the Viterbi 
alignment. 

In analogy to the Forward partition function, the Back- 
ward partition function matrix B MM are calculated re- 
cursively as follows: 
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Figure 2 Tracing back from the AS matrix by integrating the evolutionary coupling information. In query q, the coupled position of / is k q 
(0 / and that of/-/ is k q {i- 1). In template t, the coupled position of) is k t (j) i and that of)- 7 is k t (j- 1). M q (i) is the corresponding position in 
template t matched to position /' in q during the original tracing-back. M t (j) is the corresponding position in query q matched to position j in t 
during the original tracing-back. Additional EC scores are added into the corresponding elements in the AS matrix as shown in the figure so that 
the correct tracing back is performed. 
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Table 1 The mean SP and TC scores of the pairwise 
alignments generated by HHsearchl.2, HHsuite and 
HMMsato on the CASP9 test data set consisting of 1,138 
pairs of proteins 



Method 


Mean SP 


Mean TC 




score 


score 


HHsearch (without secondary structure 


48.69 


48.34 


information) 






HHsearch (with secondary structure information) 


50.00 


49.65 


HHsuite (without secondary structure information) 


48.47 


48.12 


HHsuite (with secondary structure information) 


49.76 


49.41 


HMMsato 


50.39 


50.02 



Bold numbers are the highest scores. 

Bmm(iJ) = 
pmin 

+B MM (i + 1J + l)PS aa (q i+l l +li+l t j+l ) 

J i ! 2 M '»^»(?i+i''/+i) * 2 W "^" ifli+i < t >+ 1 ) 

^^M^+i^+O^M, M)tj(M, M) 
+B GD (i,j+l)tj(M,D) 
+B m (i,j + l) qi (M,I)tj{M,M) 
+B DG (i+l,j) qi (M,D) 
+B M i{i+l,jMM,M)tj{M,I) 

Bm(i,j) = B MM (i + l,j + l)PS aa {q i+l l+ u+ h j+1 ) 

* qi (M,M)tj(I,M) +B m (i+lj)qi(M,M)t;(I,l) 

(11) 

B DG (iJ) = B MM (i + 1J + l)PS aa (q i+v i+li+1 t J+l ) 
* qi (M,M)tj(M,M) + B DG (i+l,j)q i (D,D) 



Table 2 The average TM-scores and GDT-TS scores of the 
3D models generated from the 1,127 pairwise test 
alignments produced by HHsearch 1.2, HHsuite and 
HMMsato 



Method 


Average 
TM-score 


Average 
GDT- TS score 


HHsearch (without secondary structure 
information) 


0.527 


0.459 


HHsearch (with secondary structure 
information) 


0.548 


0.479 


HHsuite (without secondary structure 
information) 


0.525 


0.459 


HHsuite (with secondary structure 
information) 


0.543 


0.476 


HMMsato 


0.555 


0.483 



Table 3 The statistical significance (p-values) of SP and 
TC score differences between HMMsato and the other 
two tools on the test data set 



Tools 


p-value of 
SP scores 


p-value of 
TC scores 


HMMsato - HHsearch (without secondary 
structure information) 


1.078 X 10" 6 


3.414 X 10" 7 


HMMsato - HHsearch (with secondary 
structure information) 


0.7538 


0.8082 


HMMsato - HHsuite (without secondary 
structure information) 


1.724 X 10" 8 


1.515 X 10" 9 


HMMsato - HHsuite (with secondary 
structure information) 


0.1535 


0.1087 



BiMiUj) and B GD (i,j) are calculated similarly as B MI (i,j) 
and B DG (i,j). 

Trace back maximum accuracy alignments with the 
evolutionary residue coupling information 

The Evolutionary Coupling (EC) stands for the correl- 
ation between two positions or columns in a multiple 
protein sequence alignment or a protein profile [19,20]. 
It has recently been employed to predict residue-residue 
contacts [18,19]. In order to improve profile-profile 
alignment with the evolutionary coupling information, 
we calculate the mutual information (MI) (one way of 
calculating EC value) for any two columns (z, ;) of each 
profile according to Equation (12). 



± F^X^ln ^^ (12) 



XtJCi=l 



Bold numbers are the highest scores. 



N is 21, standing for 20 amino acids plus gap. The 
joint probability of two residues X t and Xj (FijiX^Xj)) 
and the probability of residue X t {F^Xj)) are calculated in 
the same way as in [10]. However, EQj is calculated as 
the mutual information (MI) instead of the direct infor- 
mation (DI) based on the global probability model [19] 
in order to achieve the higher time efficiency. A higher 
EC value corresponds to a stronger correlation between 
two columns in the given profile. 

Based on the calculated EC value matrices for both the 
query and template profiles, top highly correlated pos- 
ition pairs with higher EC values for each profile are se- 
lected. The evolutionary residue coupling information is 
then applied to check the counterpart pairs during the 
process of tracing back through the sub-alignment score 
matrix AS (see Equation (8)) of the MAC alignment. 
Specifically, we denote the evolutionary coupled position 
for position i in query q as k q (i), and the coupled pos- 
ition of position ; in template t as k t (j). Moreover, M q (i) 
denotes the position in template t matched with position 
i in query q when tracing back the original AS matrix, 
M t (j) denotes the position in query q matched with 
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Table 4 The SP scores and TC scores with different values of w sa using HMMsato on the training data 



1/1/ 

vv sa 


o 


0.7 


0.2 


0.3 


0.4 


0.5 


0.6 


0.67 


0.62 


SP score 


4U.oy 


41 .58 


4 I ,OZ 


4 I .yz 


4Z.U0 


/IT 1 Q 

4Z. I O 


/IT 

4Z.Zo 


/IT 1 Q 
4Z. I O 


4Z.ZU 


TC score 


40.58 


41.25 


41.49 


41.58 


41.73 


41.85 


41.90 


41.85 


41.87 


0.63 


0.64 


0.65 


0.66 


0.67 


0.68 


0.69 


0.7 


0.77 


0.72 


42.19 


42.22 


42.22 


42.23 


42.23 


42.25 


42.24 


42.29 


42.29 


42.31* 


41.86 


41.89 


41.89 


41.90 


41.90 


41.92 


41.91 


41.96 


41.96 


41.98* 


0.73 


0.74 


0.75 


0.76 


0.77 


0.78 


0.79 


0.8 


0.9 


7 


42.27 


42.29 


42.27 


42.28 


42.27 


42.28 


42.27 


42.25 


42.24 


42.20 


41.94 


41.96 


41.94 


41.95 


41.94 


41.94 


41.94 


41.91 


41.91 


41.87 



Bold denotes the two best scores, and an extra superscript of star denotes the highest score. 



position j in template t when tracing back the original 
AS matrix, and w ec is the weight for the evolutionary 
coupling information. The new AS matrix integrating 
the evolutionary coupling information is recalculated as 
follows during the track back process. 

AS'(iJ) = AS(iJ) + w ec (EC(i,M t (k t (j))) 
+EC(M q (k q (i))j)) 

AS(iJ-l) = AS(iJ-l) + w ec (EC(i,M t (k t (j-l))) 
+EC(M q (k q (i))j-l)) 

AS(i-lJ-l) =AS(i-lJ-l) + w ec (EC(i-l,M t (k t (j-l))) 

+EC(M q (k q (i-l))j-l)) 

(13) 



AS(i-lJ)=AS(i-lJ) 

+w ec (£C(/-l,M,(^(/)))+£C(M^(^(/-l))j')) 

Figure 2 illustrates an exampling of taking into account 
the evolutionary coupling information during the tracing 
back process to generate the final alignment. 

Results and discussion 

Evaluation data set and metric 

We evaluated HMMsato along with HHSearch [4] and 
HHsuite on the alignments between 106 targets (queries) 
of the 9 th Critical Assessment of Techniques for Protein 
Structure Prediction (CASP9) [15,16] and their homolo- 
gous template proteins (templates) released at the CASP9 s 
web site. The alignment data set has 2,621 pairs of query 
and template proteins. 1,483 pairs associated with 60 
CASP9 targets were used as optimization data set to 
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Table 5 The SP scores and TC scores with different values of w tors using HMMsato 




0 


0.1 


0.2 


0.3 


0.31 


0.32 


0.33 


0.34 


0.35 


SP score 


42.31 


42.32 


42.35 


42.45 


42.47 


42.47 


42.47 


42.49 


42.50 


TC score 


41.98 


41.99 


42.02 


42.12 


42.14 


42.14 


42.14 


42.16 


42.16 


0.36 


0.37 


0.38 


0.39 


0.4 


0.41 


0.42 


0.43 


0.44 


0.45 


42.50 


42.51 


42.50 


42.51 


42.53* 


42.52 


42.49 


42.50 


42.50 


42.51 


42.17 


42.17 


42.17 


42.18 


42.19* 


42.19 


42.15 


42.16 


42.17 


42.17 


0.46 


0.47 


0.48 


0.49 


0.5 


0.6 


0.7 


0.8 


0.9 


7 


42.51 


42.50 


42.50 


42.50 


42.50 


42.46 


42.45 


42.40 


42.46 


42.40 


42.17 


42.16 


42.17 


42.17 


42.17 


42.13 


42.12 


42.07 


42.13 


42.07 



Bold denotes the two best scores, and an extra superscript of star denotes the highest score. 



optimize the parameters of HMMsato, and 1,138 pairs as- 
sociated with the remaining 46 CASP9 targets were used 
to test the methods. The reference (presumably true) pair- 
wise alignments of a query-template protein pair was gen- 
erated by using TMalign [25] to align the tertiary (3D) 
structures of the two proteins together. The alignments 
generated by HMMsato and other tools were evaluated by 
three metrics, including sum-of-pairs (SP) score, true col- 
umn (TC) score, and the quality of the tertiary structural 
models of the query proteins built from the alignments. 
The SP and TC scores are the two standard metrics for 



evaluating sequence alignment quality [26]. The quality of 
tertiary structural models indirectly assesses the quality of 
sequence alignments according to their effectiveness in 
guiding the construction of protein structural models. 

The SP score is the number of correctly aligned pairs 
of residue in the predicted alignment divided by the total 
number of aligned pairs of residues in the core blocks 
(i.e., sequence alignment regions precisely determined by 
structural alignment of structurally equivalent residues 
in the structures of two proteins) of the true alignment 
[23]. The TC score is the number of correctly aligned 
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Figure 4 The plot of the TM-scores and GDT-TS-scores against different values of the weight of torsion angles (w tors ). 
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columns in the core blocks of the true alignment 
[27]. The 3D model of a query protein was produced by 
MODELLER [28] based on both the pairwise alignment 
generated by an alignment method and the known struc- 
ture of the template protein in the alignment. We used 
TM-Score [29] to align a 3D model of a query protein 
against its true structure to generate TM-scores and GDT- 
TS scores [30] for the model in order to measure the qual- 
ity of the alignment used to generate the model, assuming 
better alignments lead to better 3D models with higher 
TM-scores and GDT-TS scores. Both TM-score and GDT- 
TS score are in the range [0, 1] [31]. 



Optimization of weights for the solvent accessibility, 
torsion angles and evolutionary coupling information 

We estimated the weights of the solvent accessibility, tor- 
sion angles and evolutionary residue coupling information 
on the training alignments step by step. Firstly, we found 
the best weight value (w sa = 0.72) for solvent accessibility. 
Then, we identified the best weight value (w tors = 0.4) for 
torsion angles while keeping the weight for solvent acces- 
sibility fixed. Finally, we found the best parameter value 



(w ec = 0.1) for the evolutionary residue coupling informa- 
tion by keeping w sa and w tors at their optimum values. 
HHsearch and HHsuite were both evaluated with and 
without secondary structure information. The default par- 
ameter values were used with HHsearch and HHsuite. 



Comparison of HMMsato, HHSearch, and HHsuite on the 
test data set 

The mean SP and TC scores for the pairwise alignment re- 
sults generated by HMMsato, HHSearch and HHsuite for 
1,138 protein pairs are reported in Table 1. The mean SP 
score and the mean TC score of HMMsato are 50.39 and 
50.02 respectively, higher than HHsearch and HHsuite 
with or without secondary structure information. The aver- 
age TM-scores and GDT-TS scores of the 3D models suc- 
cessfully generated from 1,127 out of 1,138 alignments by 
MODELLER were listed in Table 2. The average TM-score 
and GDT-TS score of the models generated from the 
HMMsato alignments are 0.555 and 0.483, respectively, 
better than those of HHSearch and HHsuite. Furthermore, 
we carried out the Wilcoxon matched-pair signed-rank test 
on both SP and TC scores of the three methods on the 
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Figure 5 The plot of the SP score differences between HMMsato and HHsearch with secondary structure (HMMsearch-SS) for all the 
1138 testing pairs. X-axis represents the index of the testing pair (1-1 138), and y-ray represents the SP score difference (the SP score of 
HMMsato - the SP score of HHSearch-SS) for all the testing pairs. 
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test data set. The p-values of alignment score differ- 
ences between HMMsato and the other methods calcu- 
lated by the Wilcoxon matched-pair signed-rank test 
are reported in Table 3. 



Impact of solvent accessibility, torsion angles and 
evolutionary coupling information on the alignment 
accuracy 

We studied the effect of the solvent accessibility infor- 
mation by solely adjusting the value of its weight w sa . 
The SP scores and TC scores of the alignments gener- 
ated by HMMsato with different w sa values on the train- 
ing data set are shown in Table 4. The results show that 
incorporating the solvent accessibility information al- 
ways improves alignment accuracy in comparison with 
the baseline not using solvent accessibility information 
(w sa = 0). The highest accuracy is achieved when w sa is 
set to 0.72. Figure 3 shows the plot of SP scores/TC 
scores against the different values of w sa . Red curve rep- 
resents the SP scores and blue represents the TC scores. 

We studied the effect of torsion angles on alignments by 
solely adjusting the value of w tors (weight for torsion angle 
information) while keeping w sa as 0.72. The SP scores and 
TC scores of the alignments generated by HMMsato with 
different w tors values on the training data set are shown in 
Table 5. The results show that incorporating the torsion 



angle information also helps improve alignment accuracy. 
The highest accuracy is achieved when w t0 rs is set to 0.4. 
Figures 4 shows the TM-scores and GDT-TS scores of the 
3D models constructed from the alignments generated by 
HMMsato with both torsion angles and solvent accessibil- 
ity with respect to different w tors values. 



The effect of evolutionary residue coupling information 
on alignment accuracy 

We studied the effect of the evolutionary residue coup- 
ling information on alignment accuracy in a similar way. 
HMMsato worked the best when w ec was 0.1. However, 
the evolutionary coupling information did not improve 
the overall alignment accuracy on the training data 
set, probably due to lack of a large number of diverse 
sequences in many cases required by the evolutionary 
coupling calculation to obtain the sufficient discriminative 
power. Specifically speaking, the alignment quality in- 
creased in 57 alignments, stayed the same in 1363 align- 
ments, but decreased in 61 alignments. Similarly, on the 
test data set, the alignment quality increased in 59 align- 
ments, stayed the same in 1024 alignments, but decreased 
in 55 alignments. Generally speaking, the evolutionary 
coupling information contributed to the improvement of 
alignment accuracy in some cases, but its effect was rather 
inconsistent. 
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Figure 6 The plot of the average SP score difference between HMMsato and HHSearch-SS for the 46 testing protein targets. X-axis 
represents the index of the testing targets (1-46), and y-axis represents the score difference. 
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Comparison of HMMsato and HHSearch with secondary 
structure information on the test data set 

We studied the SP score differences between HMMsato 
and HHSearch with secondary structure for all the 1138 
testing pairs. The plot of the SP score difference (SP 
score of HMMsato minus SP score of HHSearch) for these 
pairs is shown in Figure 5. Similarly, the plot of the 
average SP score difference between HMMsato and 
HHSearch-SS for the 46 testing protein targets is shown 
in Figure 6. X-axis represents the index of the testing 
targets (1-46), and y-axis represents the score differ- 
ence. Specifically, the alignment quality increased for 24 
targets, stayed the same for 2 targets, but decreased for 
20 targets. We found that HMMsato often improved the 
alignment quality for proteins of length ranging from 70 
to 450 residues. 

Conclusion 

We designed a method to incorporate relative solvent 
accessibility, torsion angles and evolutionary residue coup- 
ling information into HMM-based pairwise profile-profile 
protein alignments. Our experiments on the large CASP9 
alignment data set showed that utilizing solvent accessibil- 
ity and torsion angles improved the accuracy of HMM- 
based pairwise profile-profile alignments. However, the 
effect of the evolutionary residue coupling information on 
alignments is less consistent according to our current 
experimental setting, even though it may still be a 
valuable source of information to explore in the future. 
Particularly, we will use the latest method (i.e., direct 
information) of calculating evolutionary coupling informa- 
tion to guide the profile alignment process. Furthermore, 
we will carry out more extensive search of optimal weights 
for solvent accessibility, torsion angle, secondary structure, 
and evolutionary coupling information to improve align- 
ment accuracy. 
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